Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig
نویسندگان
چکیده
The trace consists of cell information of about 29 days spanning across 700k jobs. This paper deals with statistical analysis of this cluster trace. Since the size of trace is very large, Hive which is a Hadoop distributed file system (HDFS) based platform for querying and analysis of Big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS through HCatalog. Apart from Hive, Pig which is a scripting language and provides abstraction on top of Hadoop was used. To the best of our knowledge the analytical method adopted by us is novel and has helped in gaining several useful insights. Clustering of jobs and arrival time has been done in this paper using K-means++ clustering followed by analysis of distribution of arrival time of jobs which revealed weibull distribution while resource usage was close to zip-f like distribution and process runtimes revealed heavy tailed distribution.
منابع مشابه
GENMR: Generalized Query Processing through Map Reduce In Cloud Database Management System
Big Data, Cloud computing, Cloud Database Management techniques, Data Science and many more are the fantasizing words which are the future of IT industry. For all the new techniques one common thing is that they deal with Data, not just Data but the Big Data. Users store their various kinds of data on cloud repositories. Cloud Database Management System deals with such large sets of data. For p...
متن کاملA multi-scale convolutional neural network for automatic cloud and cloud shadow detection from Gaofen-1 images
The reconstruction of the information contaminated by cloud and cloud shadow is an important step in pre-processing of high-resolution satellite images. The cloud and cloud shadow automatic segmentation could be the first step in the process of reconstructing the information contaminated by cloud and cloud shadow. This stage is a remarkable challenge due to the relatively inefficient performanc...
متن کاملPig vs Hive: Benchmarking High Level Query Languages
This article presents benchmarking results of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop 0.14.1. The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying ...
متن کاملA Comparison of Hadoop Tools for Analyzing Tabular Data
The paper describes the application of Hadoop modules: MapReduce, Pig and Hive, for processing and analyzing large amounts of tabular data acquired from a computer simulation of heat transfer in bio tissues. The Apache Hadoop is an open source environment for storing and analyzing BigData. It was installed on a cluster of six computing nodes, each with four cores. The implemented MapReduce job ...
متن کاملHive Collective Intelligence for Cloud Robotics: A Hybrid Distributed Robotic Controller Design for Learning and Adaptation
The recent advent of Cloud Computing, inevitably gave rise to Cloud Robotics. Whilst the field is arguably still in its infancy, great promise is shown regarding the problem of limited computational power in Robotics. This is the most evident advantage of Cloud Robotics, but, other much more significant yet subtle advantages can now be identified. Moving away from traditional Robotics, and appr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1503.06600 شماره
صفحات -
تاریخ انتشار 2015